A Modern Computational Linguistics Course Using Dutch

نویسنده

Gosse Bouma

چکیده

This paper describes material for a course in computational linguistics which concentrates on building (parts of) realistic language technology applications for Dutch. We present an overview of the reasons for developing new material, rather than using existing text-books. Next we present an overview of the course in the form of six exercises, covering advanced use of finite state methods, grammar development, and natural language interfaces. The exercises emphasise the benefits of special-purpose development tools, the importance of testing on realistic data-sets, and the possibilities for web-applications based on natural language processing. 1 I n t r o d u c t i o n This paper describes a set of exercises in computational linguistics. The material was primarily developed for two courses: an general introduction to computational linguistics, and a more advanced course focusing on natural language interfaces. Students who enter the first course have a background in either humanities computing or cognitive science. This implies that they possess some general programming skills and that they have at least some knowledge of general linguistics. Furthermore, all students entering the course are familiar with logic programming and Prolog. The native language of practically all students is Dutch. The aim of the introductory course is to provide a overview of language technology applications, of the concepts and techniques used to develop such applications, and to let students gain practical experience in developing (components) of these applications. The second course focuses on computational semantics and the construction of natural language interfaces using computational grammars. Course material for computational linguistics exists primarily in the form of text books, such as Allen (1987), Gazdar and Mellish (1989) and Covington (1994). They focus primarily on basic concepts and techniques (finite state automata , definite clause grammar, parsing algorithms, construction of semantic representations, etc.) and the implementation of toy systems for experimenting with these techniques. If course-ware is provided, it consists of the code and grammar fragments discussed in the text-material. The language used for illustration is primarily English. While attention for basic concepts and techniques is indispensable for any course in this field, one may wonder whether implementation issues need to be so prominent as they are in the text-books of, say, Gazdar and Mellish (1989) and Covington (1994). Developing natural language applications from scratch may lead to maximal control and understanding, but is also timeconsuming, requires good programming skills rather than insight in natural language phenomena, and, in tutorial settings, is restricted to toysystems. These are disadvantages for an introductory course in particular. In such a course, an attractive alternative is to skip most of the implementation issues, and focus instead on what can be achieved if one has the right tools and da ta available. The advantage is that the emphasis will shift naturally to a situation where students must concentrate primarily on developing accounts for linguistic data, on exploring data available in the form of corpora or word-lists, and on using real high-level tools. Consequently, it becomes feasible to consider not only toy-systems and toyfragments, but to develop more or less realistic components of natural language applications. As the target language of the course is Dutch, this also implies that at least some attention has to be paid to specific properties of Dutch grammar, and to (electronic) linguistic resources for Dutch. Since students nowadays have access to powerful hardware and both tools and data can be distributed easily over the internet, there are no real practical obstacles. Text-books which are concerned primarily with computational semantics and natural language interfaces, such as Pereira and Shieber (1987) and Blackburn and Bos (1998), tend to introduce a toy-domain, such as a geography database or an excerpt of a movie-script, as application area. In trying to develop exercises which are closer to real applications, we have explored the possibilities of using web-accessible databases as back-end for a natural language interface program. More in particular, we hope to achieve the following: • Students learn to use high-level tools. The development of a component for morphological analysis requires far more than what can be achieved by specifying and implementing the underlying finite state automata directly. Rather, abstract descriptions of morphological rules should be possible, and software should be provided to support development and debugging. Similarly, while a programming language such as Prolog offers possibilities for relatively high-level descriptions of natural language grammars, the advant, ages of specialised languages for implementing unification-based grammars and accompanying tools are obvious. Furthermore, the availability of graphical interfaces and visualisation in tutorial situations is a bonus which should not be underestimated. • Students learn to work with real data. In developing practical, robust, wide-coverage, language technology applications, researchers have found that the use of corpora and electronic dictionaries is absolutely indispensable. Students should gain at least some familiarity with such sources, learn how to search large datasets, and how to deal with exceptions, errors, or unclear cases in real data. • Students become familiar with quantitative evaluation methods. One advantage of developing components using real data is that one can use the evaluation metrics dominant in most current computational linguistics research. That is, an implementation of hyphenatiOn-rule or a grammar for temporal expressions can be tested by measuring its accuracy on a list of unseen words or utterances. This provides insight in the difficulty of solving similar problems in a robust fashion for unrestricted text. Students develop language technology components for Dutch. In teaching computational linguistics to students whose native language is not English, it is common practice to fbcus primarily on the question how the (English) examples in the text book can be carried over to a grammar for one's own language. As this may take considerable time and effort, more advanced topics are usually skipped. In a course which aims primarily at Dutch, and which also contains material describing some of the peculiarities of this language (hyphenation rules, spelling rules relevant to morphology, word order in main and subordinate clauses, verb clusters), there is room for developing more elaborate and extended components. Students develop realistic applications. The use of tools and real da ta makes it easier to develop components which are robust and which have relatively good coverage. Applications in the area of computational semantics can be made more interesting by exploiting the possibilities offered by the internet. The growing amount of information available on the internet provides opportunities for accessing much larger databases (such as public transport time-tables or library catalogues), and therefore, for developing more realistic applications. The sections below are primarily concerned with a number of exercises we have developed to achieve the goals mentioned above. A accompanying text is under development. 1 2 F i n i t e S t a t e M e t h o d s A typical course in computational linguistics starts with finite state methods. Finite state techniques can provide computationally efficient solutions to a wide range of tasks in natural language processing. Therefore, students should be familiar with the basic concepts of automata (states and transitions, recognizers and transducers, properties of automata) and should know how to solve t See www. let. rug. nl/~gosse/tt for a preliminary version of the text and links to the exercises described

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a corpus of spoken Dutch

In this paper the Spoken Dutch Corpus Project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overview of the project. It then goes ...

متن کامل

Zero to Spoken Dialogue System in One Quarter: Teaching Computational Linguistics to Linguists Using Regulus

This paper describes a Computational Linguistics course designed for Linguistics students. The course is structured around the architecture of a Spoken Dialogue System and makes extensive use of the dialogue system tools and examples available in the Regulus Open Source Project. Although only a quarter long course, students learn Computational Linguistics and programming sufficient to build the...

متن کامل

The Impact of the Belgian-dutch Merge

Undoubtedly the reader has noticed the final step in the merge of the Belgian and Dutch AI communities: the official BNVKI/AIABN logo on the cover of this newsletter. In the last year, the newsletter has made a gradual shift from a mainly-Dutch “nieuwsbrief” to a mainly-English (and hopefully soon, completely-English) newsletter. The impact of the merge is clearly visible in this issue of the f...

متن کامل

Spontaneous Speech in the Spoken Dutch Corpus

In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a corpus of 1,000 hours of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of (computational) linguistics and language and speech technology. Although the corpus will contain a fair amount of read speech...

متن کامل

bcn Algorithms for Linguistic Processing

Bibliography 55 Introduction Algorithms for Linguistic Processing is a research proposal in the area of computational linguistics. The proposal focuses on problems of ambiguity and processing efficiency by investigating grammar approximation and grammar specialization techniques. In this section we will try to explain for a broader audience what computational linguistics is, and what the curren...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

A Modern Computational Linguistics Course Using Dutch

نویسنده

چکیده

منابع مشابه

Building a corpus of spoken Dutch

Zero to Spoken Dialogue System in One Quarter: Teaching Computational Linguistics to Linguists Using Regulus

The Impact of the Belgian-dutch Merge

Spontaneous Speech in the Spoken Dutch Corpus

bcn Algorithms for Linguistic Processing

عنوان ژورنال:

اشتراک گذاری